Packages

library(tidyverse)
library(moderndive)
library(gridExtra)

Our Population: The “Bowl”

What proportion of the balls are red? We could count all of them (take a census), but that would be a long process.

MD Figure 7.1: A bowl filled with red and white balls.

MD Figure 7.1: A bowl filled with red and white balls.

Our Sample: The “Shovel”

Instead of counting the entire population (census), we can select a random sample to use as an estimate.

MD Figure 7.3: Removing 50 balls from the bowl.

MD Figure 7.3: Removing 50 balls from the bowl.

Random Sampling

Assuming there are more than 50 red balls and 50 white balls, when we take a random sample of 50 balls from the bowl we could get anywhere from 0 to 50 red balls (or a proportion of 0.00 = 0% to 1.00 = 100%). In other words, all proportions are possible. But are they all equally probable? Why or why not?

ANSWER: Not all proportions are equally likely. This is similar to the binomial scenarios we discussed before, in terms of the combinations of red and white. There is only one way to get all red or all white balls, but there are multiple ways to get some red and some white. Therefore, a mix is more likely. Exactly what mix depends on what the fraction of red balls.

Virtual Sampling

We cannot physically replicate this experiment, but we can do so virtually. The moderndive package contains a virtual object bowl that represents the physical object. Scoop one virtual_shovel of balls by running the following code.

virtual_shovel <- bowl %>% 
  rep_sample_n(size = 50, reps = 1)

# each row corresponds to one ball
# replicate = 1 since it's 1 sample

virtual_shovel
# A tibble: 50 x 3
# Groups:   replicate [1]
   replicate ball_ID color
       <int>   <int> <chr>
 1         1     682 white
 2         1    2302 red  
 3         1    1542 white
 4         1    2091 red  
 5         1     510 white
 6         1    1915 white
 7         1    1447 red  
 8         1    2193 red  
 9         1    1288 red  
10         1    1777 white
# … with 40 more rows

The rep_sample_n() function is in package the moderndive.

rep_sample_n(tbl, size, replace = FALSE, reps = 1, prob = NULL)

Defaults are sampling without replacement, equal probability of being selected, and one repetition of the chosen size.

Proportion of Red Balls

The proportion of red balls in your sample is an estimate of the proportion of red balls in the bowl.

# calculate count and proportion of red

virtual_shovel %>% 
  mutate(is_red = (color == "red")) %>% 
  summarize(num_red  = sum(is_red),
            prop_red = mean(is_red))
# A tibble: 1 x 3
  replicate num_red prop_red
      <int>   <int>    <dbl>
1         1      12     0.24
# plot the number of red and white balls

ggplot(virtual_shovel, aes(x = color, fill = color)) +
  geom_bar(color = "black") +
  scale_fill_manual(values = c("red", "white")) +
  theme_linedraw() +
  theme(legend.position = "none")

# visualize what is happening at each step
# break down the code, process by process

# create logical vector for the red balls

virtual_shovel %>% 
  mutate(is_red = (color == "red"))
# A tibble: 50 x 4
# Groups:   replicate [1]
   replicate ball_ID color is_red
       <int>   <int> <chr> <lgl> 
 1         1     682 white FALSE 
 2         1    2302 red   TRUE  
 3         1    1542 white FALSE 
 4         1    2091 red   TRUE  
 5         1     510 white FALSE 
 6         1    1915 white FALSE 
 7         1    1447 red   TRUE  
 8         1    2193 red   TRUE  
 9         1    1288 red   TRUE  
10         1    1777 white FALSE 
# … with 40 more rows
# count the number of TRUE in the vector

virtual_shovel %>% 
  mutate(is_red = (color == "red")) %>% 
  summarize(num_red  = sum(is_red))
# A tibble: 1 x 2
  replicate num_red
      <int>   <int>
1         1      12
# now also calculate proportion of TRUEs

virtual_shovel %>% 
  mutate(is_red = (color == "red")) %>% 
  summarize(num_red  = sum(is_red),
            prop_red = mean(is_red))
# A tibble: 1 x 3
  replicate num_red prop_red
      <int>   <int>    <dbl>
1         1      12     0.24

Note that the code is slightly different than the code found in Modern Dive, Chapter 7. This code is algebraically equivalent, but automatically works for sample sizes other than 50.

What if we wanted to calculate the count and proportion of white balls in the sample? How would you adapt the code?

# does both red and white counts and props

virtual_shovel %>% 
  mutate(is_red = (color == "red"),
         is_white = (color == "white")) %>% 
  summarize(num_red  = sum(is_red),
            prop_red = mean(is_red),
            num_white = sum(is_white),
            prop_white = mean(is_white))
# A tibble: 1 x 5
  replicate num_red prop_red num_white prop_white
      <int>   <int>    <dbl>     <int>      <dbl>
1         1      12     0.24        38       0.76

Taking Multiple Samples

In the physical scenario described in Modern Dive, there were 33 students in the class, each of whom drew one sample from the physical bowl. Their results are shown below.

class_plot <- ggplot(tactile_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white", fill = "skyblue") +
  scale_x_continuous(breaks = seq(from = 0, to = 1, by = 0.10),
                     limits = c(0, 1)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  labs(title = "Distribution of 33 Sample Proportions of Red Balls",
       subtitle = "tactile sampling",
       x = "Proportion of Balls That Were Red (sample size n = 50)", 
       y = "Frequency") +
  theme_linedraw()
class_plot

True Proportion of Red

What seems like the most probable value for \(p\) = proportion of red balls in the bowl, based on the plot of 33 samples? Why?

ANSWER: Most students said the proportion was likely between 0.35 and 0.40, based on where the tallest bar is on the plot and where the center or balance point of the distirbution is likely to be.

(We could calculate it directly, but in most real-life scenarios we cannot. Exploring cases where the solution is known is how we develop reliable methods for unknown cases.)

Virtually Replicate the 33 Samples

Thanks to the power of computers, you can replicate the class sampling process. Set the reps to 33 so you get one sample of size 50 for each virtual student.

# first we each took one sample of size 50
# now we are taking 33 samples of size 50
# run the scoop through the bowl 33 times
# (replace the balls in between scoops)

virtual_shovel <- bowl %>% 
  rep_sample_n(size = 50, reps = 33)

# size = how big is the shovel?
# reps is how many times we use it

# 33 samples, each with 50 balls in each sample
# how many rows will the output have?
# 33 x 50 = 1650, each row is 1 ball

virtual_shovel
# A tibble: 1,650 x 3
# Groups:   replicate [33]
   replicate ball_ID color
       <int>   <int> <chr>
 1         1    1087 white
 2         1    1674 white
 3         1    2141 white
 4         1     799 white
 5         1    1300 red  
 6         1    1529 red  
 7         1     310 red  
 8         1     644 white
 9         1    1152 white
10         1    1851 red  
# … with 1,640 more rows

Calculate the count and proportion of red balls in each sample (replicate) and store it as an object called virtual_prop_red. Print out the results. *Hint: You will need to use group_by() to get separate proportions for each sample (replicate).

# above we had one sample to summarize--one replicate
# now each sample (replicate) is a separate group
# find the proportion of red in each sample

virtual_prop_red <- virtual_shovel %>% 
  group_by(replicate) %>% # group by shovelful / sample
  mutate(is_red = (color == "red")) %>% 
  summarize(num_red  = sum(is_red),
            prop_red = mean(is_red))

# how many rows will the output have?
# there would be 33 rows, one for each sample
# proportions are the "data" we interested in here

virtual_prop_red
# A tibble: 33 x 3
   replicate num_red prop_red
       <int>   <int>    <dbl>
 1         1      16     0.32
 2         2      19     0.38
 3         3      20     0.4 
 4         4      23     0.46
 5         5      13     0.26
 6         6      19     0.38
 7         7      23     0.46
 8         8      12     0.24
 9         9      16     0.32
10        10      17     0.34
# … with 23 more rows

Tactile vs. Virtual Results #1

Tactile results will always be the same. Virtual results will differ unless we set.seed() to control our pseudo-random stream. Run the code below to compare your virtual results to the tactile result.

# take my samples

virtual_shovel <- bowl %>% 
  rep_sample_n(size = 50, reps = 33)

# summarize my samples (find proportions)

virtual_prop_red <- virtual_shovel %>% 
  group_by(replicate) %>% # group by shovelful / sample / 33 times in this case
  mutate(is_red = (color == "red")) %>% 
  summarize(num_red  = sum(is_red),
            prop_red = mean(is_red))

# plot my 33 virtual sample proportions
# examine how the proportions are distributed
# notice we can save a plot as an object

sim_plot <- ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, 
                 boundary = 0.5, 
                 color = "white", 
                 fill = "skyblue") +
  scale_x_continuous(breaks = seq(from = 0, to = 1, by = 0.10),
                     limits = c(0, 1)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  labs(title = "Distribution of 33 Sample Proportions of Red Balls",
       subtitle = "virtual sampling",
       x = "Proportion of Balls That Were Red (sample size n = 50)", 
       y = "Frequency") +
  theme_linedraw()

# compare the virtual and tactile proportion plots
# tactile plot is a constant, virtual plot varies

grid.arrange(class_plot, sim_plot)

# to see just my plot alone

sim_plot

Sampling Distribution

We have a distribution of a statistic that comes from repeated samples. In this case, the statistic is the proportion of red balls in each sample. How do those proportions differ from sample to sample?

We need to understand the sampling distribution of the statistic we are using to estimate a population value. What is the shape, center, and spread (variability)?

We can visualize shape with plots, both histograms and other types of plots (e.g., density plots, normal quantile-quantile plots). Usually we characterize center and spread using mean and SD. Calculate those summary values for the simulated proportions stored in virtual_prop_red that we created above.

virtual_prop_red %>% 
  summarize(mean_props = mean(prop_red),
            sd_props = sd(prop_red))
# A tibble: 1 x 2
  mean_props sd_props
       <dbl>    <dbl>
1      0.381   0.0711

So the mean of all my sample proportions is shown above. The average deviation is also shown above.

Can you tell the shape from the plot?

ANSWER: Sometimes it looks kind of bell shaped, but not always…it can vary quite a lot from simulation to simulation.

Taking More Samples!

Our 33 random samples are not quite enough if we are trying to understand the true variability that can occur from sample to sample. There are too many “gaps” in the distribution and it changes too much from simulation to simulation. Let’s take more samples!

Adapt the code to take 1000 samples of size 50. Summarize the count and proportion of red balls in each sample and save it as the object virtual_prop_red_1000.

# take the number of samples I want, of the given size I want
# I will have 50 x 1000 rows in the output, each row is a ball
# each sample has 50 calls in it, and I'm taking 1000 samples
# (running the shovel through the bowl 1000 times, digitally)

virtual_samples_1000 <- bowl %>% 
  rep_sample_n(size = 50, reps = 1000)

# compute the proportion of red balls in each sample
# I will have 1000 rows in my output this time, each 
# row is one sample's proportion of red balls

virtual_prop_red_1000 <- virtual_samples_1000 %>% 
  group_by(replicate) %>%  # group by sample number
  mutate(is_red = (color == "red")) %>% 
  summarize(num_red  = sum(is_red),
            prop_red = mean(is_red))

Virtual Results #2

What shape does your distribution of virtual samples appear to have (approximately)? Run the code below.

ggplot(virtual_prop_red_1000, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, 
                 boundary = 0.5, 
                 color = "white", 
                 fill = "skyblue") +
  scale_x_continuous(breaks = seq(from = 0, to = 1, by = 0.10),
                     limits = c(0, 1)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  labs(title = "Distribution of 1000 Sample Proportions of Red Balls",
       subtitle = "virtual sampling",
       x = "Proportion of Balls That Were Red (sample size n = 50)", 
       y = "Frequency") +
  theme_linedraw()

Compute the mean and standard deviation of the sample proportions in virtual_prop_red_1000.

virtual_prop_red_1000 %>% 
  summarize(mean_props = mean(prop_red),
            sd_props = sd(prop_red))
# A tibble: 1 x 2
  mean_props sd_props
       <dbl>    <dbl>
1      0.373   0.0679

Putting It all Together

Consider our general algorithm, which we have been repeating throughout. This code is built on what we used above, with a couple tweaks to make it easier to use to explore.

# packages tidyverse, moderndive (or infer)
# set sample size and number of samples
# these values feed into the code below

sample_size <- 100
num_samples <- 1000

# take the number of samples I want, of the size I want
# sample size and number of samples from the input values
# resulting tbl will have sample_size x num_samples rows

virtual_samples <- bowl %>% 
  rep_sample_n(size = sample_size, reps = num_samples)

# compute the proportion of red balls in each sample
# resulting tbl (dataset) will have num_samples rows

virtual_prop_red <- virtual_samples %>% 
  group_by(replicate) %>%  # group by sample number
  mutate(is_red = (color == "red")) %>% 
  summarize(num_red  = sum(is_red),
            prop_red = mean(is_red))

# compute the mean and SD of my sample proportions
# resulting tbl (dataset) will have a single row

virtual_prop_red %>% 
  summarize(mean_props = mean(prop_red),
            sd_props = sd(prop_red))
# A tibble: 1 x 2
  mean_props sd_props
       <dbl>    <dbl>
1      0.373   0.0466
# plot my sample proportions
# plot scaling held constant

ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, 
                 boundary = 0.5, 
                 color = "white", 
                 fill = "skyblue") +
  scale_x_continuous(breaks = seq(from = 0, to = 1, by = 0.10),
                     limits = c(0, 1)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  labs(title = "Distribution of Sample Proportions of Red Balls",
       subtitle = sprintf("virtual sampling, sample size = %d, number of samples = %d", sample_size, num_samples),
       x = "Proportion of Balls That Were Red", 
       y = "Frequency") +
  theme_linedraw()

# plot my sample proportions
# variable plot scaling

ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(bins = max(10, sqrt(sample_size)), 
                 color = "white", 
                 fill = "skyblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  labs(title = "Distribution of Sample Proportions of Red Balls",
       subtitle = sprintf("virtual sampling, sample size = %d, number of samples = %d", sample_size, num_samples),
       x = "Proportion of Balls That Were Red", 
       y = "Frequency") +
  theme_linedraw()

Taking Larger Samples

When sample size goes up, each individual sample is tends to be more like the population (on average) and we get less sampling variability overall, though we can still get an extreme sample.

Use the “putting it all together” code above to explore what happens to shape, center, and spread when sample size changes. The max sample size is 2400 (the number of balls in the bowl). What happens if you take a sample that large?


sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3    moderndive_0.5.1 forcats_0.5.1    stringr_1.4.0   
 [5] dplyr_1.0.5      purrr_0.3.4      readr_1.4.0      tidyr_1.1.3     
 [9] tibble_3.1.0     ggplot2_3.3.3    tidyverse_1.3.0 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           lubridate_1.7.10     formula.tools_1.7.1 
 [4] assertthat_0.2.1     digest_0.6.27        utf8_1.2.1          
 [7] R6_2.5.0             cellranger_1.1.0     backports_1.2.1     
[10] reprex_1.0.0         evaluate_0.14        httr_1.4.2          
[13] highr_0.8            pillar_1.5.1         rlang_0.4.10        
[16] readxl_1.3.1         rstudioapi_0.13      jquerylib_0.1.3     
[19] rmarkdown_2.7        labeling_0.4.2       munsell_0.5.0       
[22] broom_0.7.5          compiler_3.6.0       modelr_0.1.8        
[25] janitor_2.1.0        xfun_0.22            pkgconfig_2.0.3     
[28] htmltools_0.5.1.1    tidyselect_1.1.0     fansi_0.4.2         
[31] crayon_1.4.1         dbplyr_2.1.0         withr_2.4.1         
[34] grid_3.6.0           jsonlite_1.7.2       gtable_0.3.0        
[37] lifecycle_1.0.0      DBI_1.1.1            magrittr_2.0.1      
[40] infer_0.5.4          scales_1.1.1         cli_2.3.1           
[43] stringi_1.5.3        debugme_1.1.0        farver_2.1.0        
[46] fs_1.5.0             snakecase_0.11.0     xml2_1.3.2          
[49] bslib_0.2.4          ellipsis_0.3.1       generics_0.1.0      
[52] vctrs_0.3.6          tools_3.6.0          glue_1.4.2          
[55] hms_1.0.0            yaml_2.2.1           colorspace_2.0-0    
[58] operator.tools_1.6.3 rvest_1.0.0          knitr_1.31          
[61] haven_2.3.1          sass_0.3.1